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Hierarchy of Study Designs For Evaiuating 
the Effectiveness of a STEM Education Project or Practice 



This document contains a narrative overview of the hierarchy, followed hy a one-page graphic summary. 

Purpose of the Hierarchy : 

To help agency/program officials assess which study designs are capable of producing scientifically- 
valid evidence on the effectiveness of a STEM education project or practice (“intervention” ^). 

More specifically, the hierarchy - 

■ Encompasses study designs whose purpose is to estimate an intervention’s effect on educational 
outcomes, such as student math/science achievement, or Ph.D. completion. (These are sometimes 
called “impact” studies.) The hierarchy does not apply to other types of studies that serve other 
purposes (e.g., implementation studies, longitudinal cohort studies).^ 

■ Recognizes that many designs, including less rigorous impact studies,^ can play a valuable role in 
an overall research agenda. It is not meant to imply that rigorous impact studies are appropriate 
for all interventions, or the only designs that produce useful knowledge."^ 

■ Is intended as a statement of general principles, and does not try to address all contingencies that 
may affect a study’s ability to produce valid results. 

Basis for the Hierarchy : 

■ It is based on the best scientific evidence about which study designs are most likely to 
produce valid estimates of an intervention’s true effect - evidence that spans a range of fields 
such as education, welfare/employment, criminology, psychology, and medicine.^ This evidence 
shows that many common study designs often produce erroneous conclusions, and can lead to 
practices that are ineffective or harmful. 

■ It is broadly consistent with the standards of evidence used by federal agencies and other 
authoritative organizations across a number of policy areas and contexts, including - 

- Department of Education® 

- Department of Justice, Office of Justice Programs^ 

- Department of HHS, Substance Abuse and Mental Health Services Administration* 

- Eood and Drug Administration^ 

- Helping America’s Youth (a White House initiative)’® 

- Office of Management and Budget” 

- American Psychological Association’^ 

- National Academy of Sciences, Institute of Medicine’* 

- Society for Prevention Research. 

Consistent with the hierarchy below, these various standards all recognize well-designed 
randomized controlled trials, where feasible, as the strongest design for evaluating an 
intervention’ s effectiveness, and most recognize high quality comparison-group studies as the 
best alternative when a randomized controlled trial is not feasible. 
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HIERARCHY OF STUDY DESIGNS 



I. A well-designed randomized controlled trial , where feasible, is generally the strongest 
study design for evaluating an intervention’s effectiveness. 

A. Definition : Randomized controiied triais measure an intervention’s effect by randomiy 
assigning individuais (or groups of individuais) to an intervention group or a controi group. 

Randomized controlled trials are sometimes called “experimental” study designs. 

For example, suppose one wishes to evaluate, in a randomized controlled trial, whether providing 
struggling math students in third grade with supplemental one-on-one tutoring is more effective 
than simply providing them with the school’s existing math program. The study would randomly 
assign a sufficiently large number of third-grade students to either an intervention group, which 
receives the supplemental tutoring, or to a control group, which only receives the school’ s 
existing math program. The study would then measure the math achievement of both groups over 
time. The difference in math achievement between the two groups would represent the effect of 
the supplemental tutoring compared to the school’s existing program. 

B. The unique advantage of random assignment : It enables you to assess whether the 
intervention itself, as opposed to other factors, causes the observed outcomes. 

Specifically, the process of randomly assigning a sufficiently large number of individuals into 
either an intervention group or a control group ensures, to a high degree of confidence, that there 
are no systematic differences between the groups in any characteristics (observed and 
unobserved) except one - namely, the intervention group participates in the intervention, and the 
control group does not. Therefore, assuming the randomized controlled trial is properly carried 
out, the resulting difference in outcomes between the two groups can confidently be attributed to 
the intervention and not to other factors. 

By contrast, nonrandomized studies by their nature can never be entirely confident that they are 
comparing intervention participants to non-participants who are equivalent in observed and 
unobserved characteristics (e.g., motivation). Thus, these studies cannot rule out the possibility 
that such characteristics, rather than the intervention itself, are causing an observed difference in 
outcomes between the two groups. 

C. Random assignment alone does not ensure that a trial is well-designed and thus likely 
produce valid results; other key features well-designed trials include the following 

■ Adequate sample size; 

■ Random assignment of groups (e.g., classrooms) instead of, or in addition to, individuals 
when needed to determine the intervention’s effect; 

■ Few or no systematic differences between the intervention and control groups prior to the 
intervention; 

■ Outcome data is obtained for the vast majority of sample members originally randomized 
(i.e., there is low sample “attrition”); 

■ Few or no control group members “cross over” to the intervention group after randomization, 
or otherwise benefit from the intervention (i.e., are “contaminated”); 

■ An analysis of study outcomes that is based on all sample members originally randomized, 
including those who fail to participate in the intervention (i.e., “intention-to-treaf’ analysis); 
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■ Outcome measures that are highly correlated with the true outcomes that the intervention 
seeks to affect (i.e., “valid” outcome measures) - preferably well-estahlished tests, and/or 
objective, real-world measures (e.g., percent of students graduating with a STEM degree); 

■ Where appropriate, evaluators who are unaware of which sample members are in the 
intervention group versus the control group (i.e., “blinded” evaluators); 

■ Preferably long-term follow-up; 

■ Appropriate tests for statistical significance (in group-randomized trials, “hierarchical” tests 
that are based both on the number of groups and the number of individuals in each group); 

■ Preferably, evaluation of the intervention in more than one site and/or population - preferably 
schools/institutions and populations where the intervention would typically be implemented. 



II. Well-matched comparison-group studies can be a second-best alternative when a 
randomized controlled trial is not feasible. 

A. Definition : A “comparison-group study” compares outcomes for intervention participants 
with outcomes for a comparison group chosen through methods other than randomization.^^ 

Comparison-group studies are sometimes called “quasi-experimental” studies. 

For example, a comparison-group study might compare students participating in an intervention with 
students in neighboring schools who have similar demographic characteristics (e.g., age, sex, race, 
socioeconomic status) and educational achievement levels. 

B. Among comparison-group studies, those in which the intervention and comparison groups 
are very closely matched in key characteristics are most iikeiy to produce vaiid resuits. 

The evidence suggests that, in most cases, such well-matched comparison-group studies seem to 
yield correct overall conclusions about whether an intervention is effective, ineffective, or 
harmful. However, their estimates of the size of the intervention’s impact are still often 
inaccurate, possibly resulting in misleading conclusions about the intervention’s policy or 
practical significance. As an illustrative example, a well-matched comparison-group study might 
find that a class-size reduction program raises test scores by 40 percentile points - or, 
alternatively, by 5 percentile points - when its true effect is 20 percentile points. 

C. A full discussion of matching is beyond the scope of this paper, but a key principle is that the 
two groups should be closely matched on characteristics that may predict their outcomes. 

More specifically, in an educational study, it is generally important that the two groups be 
matched on characteristics that are often correlated with educational outcomes - characteristics 
such as students’ educational achievement prior to the intervention, demographics (e.g., age, sex, 
race, poverty level), geographic location, time period in which they are studied, and methods used 
to collect their outcome data. 

In addition, the study should preferably choose the intervention and matched comparison groups 
“prospectively” - i.e., before the intervention is administered. This is because if the intervention 
and comparison groups are chosen by the evaluator after the intervention is administered 
(“retrospectively”), the evaluator may consciously or unconsciously select the two groups so as to 
generate his or her desired results. Furthermore, it is often difficult or impossible for the reader 
of the study to determine whether the evaluator did so. 
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III. other common study designs - including pre-post studies, and comparison-group 
studies without careful matching - can be usefui in generating hypotheses about 
what works, but often produce erroneous conciusions. 

A. A pre-post study examines whether participants in an intervention improve or become 
worse off during the course of the intervention, and then attributes any such improvement or 
deterioration to the intervention. 

B. The probiem with a pre-post study is that, without reference to a comparison group, it cannot 
answer whether participants’ improvement or deterioration wouid have occurred anyway, 

even without the intervention. This often leads to erroneous conclusions about the effectiveness of 
the intervention. Such studies should therefore not be relied upon to inform policy decisions, but 
may still be useful in generating hypotheses about what works that merit confirmation in more 
rigorous studies (e.g., randomized controlled trials or well-matched comparison-group studies). 

C. Likewise, comparison-group studies without ciose matching often produce erroneous conciusions, 
because of differences between the two groups that differentiaiiy affect their outcomes. 

This is true even when statistical techniques (such as regression adjustment) are used to correct for 
observed differences between the two groups. Therefore, such studies - like pre-post studies - should 
not be relied upon to inform policy decisions, but may still be useful in hypothesis-generation. 

D. Despite their iimitations, these iess rigorous designs can piay a key roie in a iarger research 
agenda. One research strategy, for example, is to sponsor or conduct low-cost, less rigorous studies 
of a wide range of interventions, to identify areas where additional research, using more rigorous 
methods, is warranted. 
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Graphic Summary : 

Hierarchy of Study Designs For Evaluating the Effectiveness of a STEM Educational Intervention 




* A randomized controlled trial is sometimes called an “experimental” study. 
** A comparison-group study is sometimes called a “quasi-experimental” study. 







